This document describes application of the PRS residualization approach to white subjects in the regards data. In particular, seven covariates are considered: alcohol use, gender, age, smoking, education (categorical), income, and weight. Eight outcomes (and their polygenic risk scores are considered): diastolic blood pressure (‘PGS000302’), glucose (‘PGS000684’), LDL (‘PGS000061’), systolic blood pressure (‘PGS000301’), total cholesterol (‘PGS000062’), triglycerides (‘PGS000066’), coronary artery disease (‘PGS000011’), and height (‘PGS000297’). All of these outcomes are continuous, with the exception of coronary artery disease. For CAD, Pearson residuals will be used.
First, models will be fit with all seven covariates, and the residuals will be analyzed for structure. Then, models will be fit holding out each of the covariates in turn. All sets of residuals from leave-one-out models will be assessed via PCA and k-means clustering. For dichotomous or trichotomous covariates (smoking, alcohol use, gender), particular attention will be paid to k-means clustering.
Subjects with missing data for any of the covariates, PGSs, or outcomes will be dropped from the analysis. This leaves us with a sample size of 1413 for a complete-case analysis.
As a case study, we first estimate the glucose multiple regression using all seven covariates. Some diagnostic plots are below. The multiple regression for glucose using the PGS and all seven covariates appears to deviate moderately from the modeling assumptions of linear regression. Given that we do not want to perform inference on the regression coefficients, this does not seem like a major issue. In fact, given that we may hope the residuals to have some type of ‘bimodal’ structure in some cases, it may in fact be preferrable that the residuals are not perfectly normal.
Investigate the variance inflation factor for the seven covariates.
## GVIF Df GVIF^(1/(2*Df))
## PGS000684 1.004006 1 1.002001
## Gender_x 1.159018 1 1.076577
## Age_x 1.190480 1 1.091091
## Alc_Use 1.333417 2 1.074587
## Income 1.488390 1 1.219996
## Smoke 1.280848 2 1.063835
## ED_Cat 1.314990 3 1.046696
## Weight 1.091646 1 1.044818
Calculate the coefficient of determination for each of the models to give a sense of how predictive each covariate is. For the logistic CAD models, use AUROC.
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
Assess clustering of the residuals from the full model using the gap statistic to determine the preferred number of clusters. Plot the first two principal components and look for structure. Residuals will be standardized before performing any modeling (to account for the different scale of the outcome variables).
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
Assess clustering of the residuals from the model without gender using the gap statistic to determine the preferred number of clusters. Plot the first two principal components and look for structure. Interestingly, the results may appear to be ‘better’ when we do not scale the eight residual vectors before performing clustering. I believe that normalizing is still correct, but we may want to discuss
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## [1] "Adjusted rand index, no normalization: 0.001"
## [1] "No normalization, no gender table of clustering results"
##
## F M
## 1 402 613
## 2 163 235
## [1] "Adjusted rand index, normalized: 0.012"
## [1] "Normalized, no gender table of clustering results"
##
## F M
## 1 306 367
## 2 259 481
Assess residuals from the model without smoking. Again, we have stronger evidence of two or three clusters (which likely seems preferred) without normalizing.
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## [1] "Adjusted rand index, no normalization: 0.008"
## [1] "No normalization, no smoking table of clustering results"
##
## Current Never Past
## 1 78 297 248
## 2 52 104 126
## 3 66 211 231
## [1] "Adjusted rand index, normalized: 0.007"
## [1] "normalized, no smoking table of clustering results"
##
## Current Never Past
## 1 75 293 233
## 2 41 94 140
## 3 80 225 232
Assess residuals from the model without alcohol.
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## [1] "Adjusted rand index, no normalization: 0.003"
## [1] "No normalization, no alcohol table of clustering results"
##
## Current Never Past
## 1 139 88 49
## 2 339 164 119
## 3 292 134 89
## [1] "Adjusted rand index, normalized: 0.011"
## [1] "normalized, no alcohol table of clustering results"
##
## Current Never Past
## 1 314 137 96
## 2 131 79 69
## 3 325 170 92
Assess residuals from the model without weight. There appears to be an oddly large number of duplicate values for weight - this warrants further investigation.
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
Assess residuals from the model without income. Look at only normalized results for now. There appears to be very few unique values for income.
## Warning: did not converge in 10 iterations
Assess residuals from the model without age. Look at only normalized results for now.
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
Assess residuals from the model without education. Look at only normalized results for now. Compare 4-center k-means clustering to the four education categories with the adjusted Rand index.
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## Warning: did not converge in 10 iterations
## [1] "Adjusted rand index, normalized: 0.002"
## [1] "normalized, no education table of clustering results"
##
## College graduate and above High school graduate Less than high school
## 1 168 104 29
## 2 172 95 34
## 3 98 49 29
## 4 129 83 30
##
## Some college
## 1 120
## 2 93
## 3 75
## 4 105